Team Information

Project Title - Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Group Number - 21

Names -

Kumar Saurabh (ksaurabh@iu.edu)

Shubham Thakur (sbmthakur@iu.edu)

Ameya Dalvi (abdalvi@iu.edu)

Vishwa Shrirame (vshriram@iu.edu)

Team Photos

WhatsApp Image 2021-11-16 at 9.49.21 PM.jpeg

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Summary of Application train

Dividing the train dataset into train and validation sets

Summary Statistics -

Determine the categorical and numerical features

Missing data Analysis

Missing data for application train

Bar plot of missing values in each column

We notice there are a lot of missing values in the dataset

Bar plot of the percentage of missing values in each column

Imputing Missing data

We will construct the numerical pipeline and categorical pipeline

Correlation Analysis

Correlation with the target column

Pair plot of the 4 top correlated features

We can see the plot of the top 4 correlated attributes with the Target column. EXT_SOURCE_1 seems to be normally distributed while others are skewed but can be approximated to normal distribution.

Heatmap of correlated attributes

Bar plot of correlated attributes

Visual EDA

Distribution of the target column

Evaluating categorical features with respect to TARGET

TARGET - 0: LOAN WAS REPAID 1: LOAN WAS NOT REPAID

Categorical distribution

Numerical distribution

We notice that there are many outliers in the data as seen in the box plot. We can visualize the median and the quantiles of each column data by these box plots.

Histogram plot shows the distribution of data over a range. We have visualized each numerical data column's data distribution.

Applicants Age

Here we can conclude that people of age 30-50 take more loan applications

Applicants occupations

Laborers require more loans as compared to other occupation type people

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Input Features -

Modeling

Baseline Logistic Regression -

The objective function for the learning a binomial logistic regression model (log loss) can be stated as follows:

$$ \underset{\mathbf{\theta}}{\operatorname{argmin}}\left[\text{CXE}\right] = \underset{\mathbf{\theta}}{\operatorname{argmin}} \left[ -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} \right] $$

The corresponding gradient function of partial derivatives is as follows (after a little bit of math):

$$ \begin{aligned} \nabla_\text{CXE}(\mathbf{\theta}) &= \begin{pmatrix} \frac{\partial}{\partial \theta_0} \text{CXE}(\mathbf{\theta}) \\ \frac{\partial}{\partial \theta_1} \text{CXE}(\mathbf{\theta}) \\ \vdots \\ \frac{\partial}{\partial \theta_n} \text{CXE}(\mathbf{\theta}) \end{pmatrix}\\ &= \dfrac{2}{m} \mathbf{X}^T \cdot (\hat{p}_y - \mathbf{y}) \end{aligned} $$

For completeness learning a binomial logistic regression model via gradient descent would use the following step iteratively:

$$ \mathbf{\theta}^{(\text{next step})} = \mathbf{\theta} - \eta \nabla_\text{CXE}(\mathbf{\theta}) $$

Baseline Decision Tree Classifier -

Cost functions used for classification and regression.

In both cases the cost functions try to find most homogeneous branches, or branches having groups with similar responses.

Regression : sum(y — prediction)²

Classification : G = sum(pk * (1 — pk))

A Gini score gives an idea of how good a split is by how mixed the response classes are in the groups created by the split. Here, pk is proportion of same class inputs present in a particular group.

DecisionTreeClassifier

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Information gain uses the entropy measure as the impurity measure and splits a node such that it gives the most amount of information gain. Whereas Gini Impurity measures the divergences between the probability distributions of the target attribute’s values and splits a node such that it gives the least amount of impurity.

Gini : $\Large 1 - \sum^m_{i=1}(P_j^2)$

Entropy : $\Large \sum^m_{i=1}\left(P_j\cdot\:\log\:\left(P_j\right)\:)\right)$

Feature importance formula

To calculate the importance of each feature, we will mention the decision point itself and its child nodes as well. The following formula covers the calculation of feature importance.

For each decision tree, Scikit-learn calculates a nodes importance using Gini Importance, assuming only two child nodes (binary tree):

$\Large ni_j = w_jC_j - w_{left(j)}C_{left(j)} - w_{right(j)}C_{right(j)}$

Where

ni_j= the importance of node j

w_j = weighted number of samples reaching node j

C_j= the impurity value of node j

left(j) = child node from left split on node j

right(j) = child node from right split on node j

The importance for each feature on a decision tree is then calculated as:

$\Large fi_i = \frac{\sum_{j:node \hspace{0.1cm} j \hspace{0.1cm} splits \hspace{0.1cm} on \hspace{0.1cm} feature \hspace{0.1cm} i}ni_j}{\sum_{k \hspace{0.1cm} \epsilon \hspace{0.1cm} all \hspace{0.1cm} nodes }ni_k}$

$fi_i$ is feature importance for $i^{th}$ feature

These can then be normalized to a value between 0 and 1 by dividing by the sum of all feature importance values:

$\Large normfi_i = \frac{fi_i}{\sum_{j \hspace{0.1cm} \epsilon \hspace{0.1cm} all \hspace{0.1cm} features}fi_j}$

Reference: https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3#:~:text=Feature%20importance%20is%20calculated%20as,the%20more%20important%20the%20feature.

Decision Trees Parameters for classification

Baseline Random Forest Classifier -

Random Forest Parameters

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Feature Importance

Yet another great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each node’s weight is equal to the number of training samples that are associated with it.

Feature Importances of Random Forest Classifier

feature engineering for prevApp table

feature transformer for prevApp table

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Kaggle submission via the command line API

Our Submission

image-2.png

report submission

Click on this link

Abstract

The main goal of this project is to build a ML model which will predict if a loan applicant will be able to repay his/her loan. We have used existing HCDR data to train our model. We took raw data from Kaggle and started EDA to understand and identify important features available which will be used to train our model. We did visual EDA of the tables to understand the meaning of each column, whether there is a numerical or categorical variable in each column and whether data is the independent or dependent variable.We started data preprocessing in which we imputed all the missing values using median for numerical and most frequent for categorical attributes. Now the data is split into 3 parts Train, Test, and Validation. In modeling, we implemented pipelines to train our model using the following estimators: Logistic Regression, Decision Tree, and Random Forest. The test accuracy score for the estimator is 91.93, 85.46, 91.95 respectively.On Kaggle’s submission, we got a beat accuracy score for Logistic Regression which is 0.7372 hence we will be this as an estimator for predicting whether an applicant will be able to repay a loan or not.

Introduction

Data Description:

The main table is divided into two files: Train (with TARGET) and Test (without TARGET) (without TARGET).

All past credit issued to the client by other financial institutions and reported to the Credit Bureau.

Monthly balances of previous credits in Credit Bureau.

Monthly balance snapshots of the applicant's prior POS (point of sale) and cash loans with Home Credit.

Monthly balance snapshots of the applicant's prior credit cards with Home Credit.

All prior Home Credit loan applications of clients with loans in our sample.

Payment history in Home Credit for previously disbursed credits related to the loans in our sample.

The columns in the various data files are described in this file.

Analyzing a borrower’s background is crucial for banks because they have to make sure the borrower returns the loaned amount inside the allocated time interval. How can we solve this problem of identifying the borrower’s background quickly and effectively? We plan to tackle this by developing a machine learning pipeline where we use the data from the past borrowers and predict whether the borrower will be able to repay the loan or not and thus we will make use of classification and regression machine learning algorithms to aid our implementation.

The tasks to be tackled are:

WhatsApp%20Image%202021-11-16%20at%209.23.03%20PM.jpeg

Pipelines

Please explain the pipelines you created for this project and how you used them Please include code sections when necessary as well as images or any relevant material

Here we created two different pipelines for numerical and categorical features respectively. Performed standardization and imputation on the numerical features and performed imputations and one-hot encoding on the categorical features. We combined the two pipelines using Column Transformer and passed for modeling.

We are passing the combined data pipeline to Logistic Regression model in this pipeline

Here we are passing the data pipeline to the Decision Tree classifier

Here we are passing the data pipeline to the Random Forest classifier

Feature Engineering and transformers

Please explain the work you conducted on feature engineering and transformers. Please include code sections when necessary as well as images or any relevant material

We used Column Transformer to combine the numerical and categorical pipeline

We calculated the feature importances from the Decision Tree and Random Forest model

Experimental results

Discussion Experimental Results

After obtaining the results of different machine learning algorithms, we can state that logistic regression and random forest have a higher accuracy as compared to decision tree model.

Logistic regression gave the Testing accuracy of 91.93%

Decision Tree gave the Testing accuracy of 85.46%

Random Forest gave the Testing accuracy of 91.95%

The test ROC value for Logistic regression and Random Forest is 0.745 and 0.721 respectively.

After analyzing the ROC curve, we came to the conclusion, that Logistic regression and random forest performed similarly where as Decision Tree performed poorly both in terms of testing accuracy and ROC value. The ROC value for Decision Tree is 0.5389 which is as good as any random guessing. After making our kaggle submission we found that the logistic regression gives a better score and thus we can conclude that our best model would be Logistic Regression.

Conclusion

The main purpose of this project is to create a Machine Learning model that can predict whether or not a loan applicant will be able to repay the loan. Many worthy applicants with no credit history or default history are getting without any statistical analysis. The ML model which we are creating is trained with the HCDR dataset. It will be able to predict whether an applicant will be able to repay his loan or not based on the history of similar applicants in the past. This would help in filtering applicants with a good statistical backing derived from various factors that are taken into consideration. This would help both, a worthy applicant in securing a loan and the bank to grow their business further.

While training we made use of multiple estimators and concluded that Logistic Regression has best test accuracy 91.93% and on Kaggle submission we have an accuracy score of 0.7372. The result which we got after modeling provides confidence that it will be able to successfully predict applicants’ credit worthiness.

This is the first iteration of our model and subsequent iterations of the same are to be followed in the next phase. Moving on, we will be performing several iterations of Feature Engineering and Hyper-Parameter Tuning.

Kaggle Submission

Please provide a screenshot of your best kaggle submission.
The screenshot should show the different details of the submission and not just the score.

image.png image-2.png

References

Some of the material in this notebook has been adopted from here

  1. Understanding AUC - ROC Curve
  2. Better Heatmaps and Correlation Matrix Plots in Python
  3. Bar Plots and Modern Alternatives
  4. Data Visualization using Matplotlib
  5. sklearn.ensemble.RandomForestClassifier
  6. sklearn.ensemble.RandomForestClassifier

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: